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This paper gives precise summary on the application of stepwise regression 
model based upon the pre-process analysis of boxplot for four chemical 
compounds into four different qualities of agarwood oil. In the global 
market, agarwood oil is acknowledged as a pricey and valuable nature 
product owing to its benefits. Unfortunately, there is no standard grading 
method for agarwood oil grade classification. Intelligent model in grading 
the quality of agarwood oil is crucial as one of the efforts to classify the 
agarwood quality. The main model chosen in this study is stepwise 
regression by concemed specific parameter which is the value of correlation 
coefficient, R2. To achieve this goal, four out of eleven significant 
compounds of agarwood oil that consist of 660 data samples from low, 
medium low, medium high and high quality are representing the input. The 
independent variables are X1, X2, X3 and X4 which refer to the y- 
Eudesmol, 10-epi-y-eudesmol, f-agarofuran and dihydrocollumellarin 
compounds, respectively. MATLAB software version r2015a has been 
chosen as the simulation platform for this research work. The result showed 
that the stepwise regression model has a correlation coefficient of 0.756 and 
p-value less than 0.05 significance level which successfully passed the 
performance criteria toward regression value. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Zakiah Mohd Yusoff 


Department of System, School of Electrical Engineering, College of Engineering 


Universiti Teknologi MARA 


Cawangan Johor, Kampus Pasir Gudang, 81750, Masai, Johor, Malaysia 


Email: zakiah9018 @uitm.edu.my 


1. INTRODUCTION 


Agarwood oil is particularly valuable as natural products used as incense, fragrant, shampoo, 
traditional medicine (healing stomach complaints, diarrhoea, lungs and liver pain), and for perfumery 
(especially the dark colors) [1]-[5]. In fact, the infected heartwood of Aqualaria species is the most expensive 
oil in the market [3], [5], [6]. The use of agarwood oil for variety purposes has recently grown in popularity. 
Market statistics indicate strong growth in the purchase of agarwood oil in the Middle East countries (United 
Arab Emirates, Saudi Arabia), China and Japan. Along with this growth in consumer sales, the need for 
agarwood oil productivity has increased including in Malaysia [3], [7], [8]. To cope with that, the population 
of agarwood plantations in Malaysia is rapidly expanding in Perak, Terengganu, Kelantan, Pahang even 


around the country [9], [10]. 
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The agarwood oils are traded and priced differently depending on its grade which ranges from low, 
medium low, medium high and high quality [11], [12]. Traditionally, agarwood oil has been graded simply 
on the basis of its colour, resin content, taste, long-lasting aroma, and density by a human sensory panel for a 
long time [10], [11]. However, the sensory evaluation method is a little off. There is no guarantee that 
grading essential oils depending on human sensory evaluation can guarantee its purity or quality. The human 
trained grader technique has a considerable disadvantage in terms of objectivity and consistency when 
working with multiple samples at once, resulting in a labor-intensive and time-consuming procedure [1], 
[11], [13]. Many innovative technologies for enhancing the stability and availability of essential oils have 
emerged to address these inadequacies of essential oil products [14], [15]. Traditional systems are gradually 
being superseded by modern systems of quality assessment grading with new indexes ranging from 
qualitative to quantitative analysis [10]. A scientific method is an alternative solution to handling the grading 
issue. With new development of data analysis, there is several platforms where agarwood oil quality 
classification can be done solely based on their chemical profiles using intelligent methods, enabling 
essential oils to be classified into their respective classes (low, medium low, medium high, or high quality) 
and the findings to be measured accurately. Several researchers have been proposed machine learning 
techniques to verify the quality of agarwood oil such as artificial neural network (ANN), linear regression, k- 
Nearest Neighbor (k-NN), selforganizing map (SOM) and OVO multiclass support vector machine (SVM) 
[1], [13], [16]-[18]. Studies on the agarwood oil seven chemical compounds (f-agarofuran, 10-epi-Y- 
eudesmol, Y-Eudesmol, Eudesmol, Hexadecanol, o-agarofuran and Longifolol) have reported value of 
correlation coefficient, R is equal to | at hidden neurons number 2 outperforms shows the best performance 
since the mean squared error (MSE) value is the lowest compared to other neurons with 7.69x10°! [16]. The 
SOM model found three chemical compounds which are a-agarofuran, B-agarofuran and 10-epi-~-eudesmol 
were determined to be significant compounds for agarwood oil [13]. Majority of previous research work only 
include two qualities which are low and high quality. 

Hence, according to the intelligent techniques mentioned above, this research work more focuses on 
using stepwise regression instead of linear regression as the main model to classify the grade of agarwood oil 
into four qualities (low, medium low, medium high, and high quality), as suggested by paper [16]. The 
advantage on stepwise regression approach is that it is improves the efficacy of agarwood grading by sorting 
the significant chemical compounds and able to produce an intended result in four different qualities for other 
researchers. Intelligent model also able to overcome the traditional method in terms of time-consuming and 
consistency. 


2. THEORETICAL WORK 
2.1. Stepwise regression model 

Stepwise regression analysis is a multiple regression analysis method. Multi-stepwise regression 
analysis is the most reliable mathematical statistical method in scientific research [19], [20] which can sort 
and analyze quantitative dependence between one dependent variable and multiple independent variables. 
Regression analysis is used to study the interdependence of multiple variables while stepwise regression 
analysis is frequently used to discover the ideal or most appropriate regression model to study the 
interdependence of variables in more depth. The stepwise regression petentially capable of adding or deleting 
one variable at a time have been the favourable methods [21]. 

The approach is to introduce the independent variables into the regression equation one at a time 
based on their influence on the dependent variables [22]. Stepwise regression minimizes the independent 
varibles, X by using two processes which are forward and backward algorithm [23], [24]. The first approach 
is a forward selection method. Simultaneously, the significance test of each introduced independent variable 
is performed, and non-significant independent variables are eliminated. The independent variable starts to 
add into the regression equation only when they are statistically significant. By repeating this process, the 
most important variable from among numerous independent variables is finally chosen and a regression 
equation (mathematical model) which reasonably reflects the relationship between independent variables and 
dependent variables is established [25]. If the variables are not statistically significant, they will be 
eliminated one by one. For stepwise regression, the F-test and p-value are commonly used as tested values by 
referring to the p-value [26]. In statistics, the standard significance of p-value is less than 0.05 [27]. 


3. METHOD 
3.1. Data preparation 

The sample of dataset of agarwood oil used in this research was obtained from the previous 
researcher and consist of 660 samples between four different qualities (low, medium low, medium high and 
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high qualities). Out of that, 210 samples are from low quality, 90 samples from medium low qualities, 30 
from medium high qualities while another 330 samples are from high quality classes [13]. There are eleven 
compounds which are dihydrocollumellarin, y-eudesmol, a-guaiene, B-agarofuran, 10-epi-y-eudesmol, ar- 
curcumene, Ycadinene, valerianol, a-agarofuran, alloaromadendrene epoxide and B-dihydroagarofuran. The 
best four out of eleven agarwood oil chemical compounds were chosen based on the boxplot analysis. The 
selected chemical compounds employed in stepwise regression are y-Eudesmol, 10-epi-x-eudesmol, p- 
agarofuran and dihydrocollumellarin as highlighted in Table 1. All simulations were carried out using 
MATLAB software version R2015a. 


Table 1. The compilation from boxplot result based on the median of abundances (%) for 
each chemical compounds 


| ¥-eudesmol 0.64 0.55 0.63 041 ft 
10-epi-x-eudesmol 0.76 0.5 0.65 0.41 I 
B-agarofuran 0.53 0.99 0.32 0.01 I 

| dihydrocollumeltarin, DF gl oO ea aT ee 
a-guaiene 0.26 0.01 0.17 0.01 
ar-curcumene 0.5 0.01 0.01 0.76 
B-dihydro agarofuran 0.33 0.01 0.01 0.01 
Y-cadinene 0.01 0.18 0.23 0.41 
a-agarofuran 0.68 0.01 0.27 0.01 
allo aromadendrene epoxide 0.01 0.01 0.01 0.36 
Valerianol 0.35 0.01 0.01 0.01 


3.2. Flowchart of experimental set-up 

Firstly, the experiment starts with empty variables as mentioned on flowchart in Figure 1. Then, the 
p-value will be computed if there are no variables in the stepwise regression model. There are two process 
involved for stepwise regression which are forward selection and backward elimination. The observed p- 
value at each of the independent variables of agarwood oil were in detail. The value of p-value should be less 
than 0.05 for the X independent variable can be added into the model or otherwise the X variable is not added 
but the forward selection process will remain continuous until there is X variable in the regression model. 
Next, backward elimination will observe the p-value of X variables that have more than 0.05. An action of 
removing the variable X will be done or else, the selected variables will be maintained in the model. The 
selected variables will be the output feature of regression model and will be the input feature or marker to 
other intelligent models. (1)-(4) show the proposed calculation for the degree of freedom, coefficient of 
determination, adjusted R? and root mean square error (RMSE). 


Degree of Freedom, DF = P—Q (1) 

Coef ficient of determination, R* = = 1-— (2) 
ey, (ee linens a eal 

Adjusted R* =1 Ca) Xe (3) 


RMSE = jes (4) 
DF 
Where, 


P = the number of rows from the data of agarwood oil samples 

DF = error degrees of freedom 

Q = number of coefficients 

R? = coefficient of determination 

SSR = sum of squared regression 

SST = sum of squared total 

SSE = sum of squared errors 

RMSE = the estimation of the standard deviation of the error distribution 
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Observed regression performance of selected 
variables and accepts linear equation 


Figure |. Detail experiment of stepwise regression model 


4. RESULTS AND DISCUSSION 

In this section, the result obtained by training the regression model into four qualities which are low, 
medium low, medium high and high quality of agarwood oil samples. As specified in the methodology 
section, 660 samples of agarwood oil were employed in the analytical procedure. The data samples were then 
evaluated and developed into a stepwise regression model. 


4.1. Generate stepwise regreesion 

As can be seen in Figure 2, the independent variables are X3, X2, X1 and X4 which refer to the 
compound f-agarofuran, 10-epi-y-eudesmol, y-Eudesmol and dihydrocollumellarin, respectively. All the 
independent variables have been selected by observing the p-value of each variable where all of them have p- 
value less than 0.05 significance level. The results of estimated coefficients of predicted output for agarwood 
oil has been done and tabulated in Table 2. It is found that, the highest p-value found at intercept X2 which is 
4.6661x107!, while the lowest value p-value is 2.055110 which is at intercept X4. 

Based on the summary findings summarizes in Table 3 indicate that R? value is 0.756 or 75.6%. 
Therefore, according to the theory, the correlation coefficient, R? should have value below 80% for the best fit of 
stepwise regression. There were five numbers of predictors where belong to intercept and selected four X 
variables. The overall P-value for the F-test was 3.27x10!” < 0.05. Those four chemical compounds of four 
different agarwood oil qualities which are y-Eudesmol, 10-epi-s-eudesmol, B-agarofuran and dihydrocollumellarin 
have proved that they all passed the performance for stepwise regression. 
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1. Adding x3, FStat = 1090.2348, pValue = 9.4581127e-142 
2. Adding x2, FStat = 245.8135, pValue = 2.691565e-47 

3. Adding xl, FStat = 56.7161, pValue = 1.67277e-13 

4. 


Adding x4, FStat = 22.953, pValue = 2.055le-06 


Figure 2. Stepwise regression output generated from the MATLAB command window 


Table 2. Estimate coefficients of stepwise regression 


Estimate value Standard Error, SE __ t-statistics P-value 
(Intercept) -1.196 0.094016 -12.722 2.7332 x 108 
Xl 0.49927 0.097009 5.1466 3.5111 x 10°” 
xK2 2.5986 0.15815 16.431 4.6661 x 10°! 
X3 2.3084 0.11077 20.84 2.2078 x 1074 
x4 0.39217 0.081858 4.7909 2.0551 x 10° 


Table 3. Summary ouput of stepwise regression 


Parameter Values 
Number of coefficients, Q 5 
Degrees of freedom, DF 655 
Root mean squared error, RMSE 0.671 
R? 0.756 
P-value 3.27 x 10! 
SSR 915.9689 
SST 1.2109 x 10° 
SSE 294.9402 


Preliminary Linear regression equation model is as (5): 
Y~1+X1+X2+X3+X4 (5) 


hence, Y = - 1.196 + 0.49927X1 + 2.5986X2 + 2.3084X3 + 0.39217X4 


5. CONCLUSION 

Various research has attempted to demonstrate that the insufficiency of quality classification using 
conventional techniques might have an impact on the grading system. This paper effectively showed the 
development of the pre-analysis of agarwood oil classification using stepwise regression. This study has 
successfully applied a stepwise regression model to predict the agarwood oil quality classification. Four 
selected compounds are selected as input data which contain low, medium low, medium high and high 
quality of agarwood oil. With that, 0.756 of correlation coefficient, R? for the model result have been 
achieved. The findings strongly proved that agarwood oil obtained a best fit stepwise regression with the 
value of R? exactly below 0.8. Overall P-value also below 0.05 which 3.27x10'” in more specific. The 
findings can be evaluated and used as a reference in classifying the agarwood oil quality grading in the 
future. 
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